Novel machine learning methods for computational chemistry
نویسنده
چکیده
The experimental assessment of absorption, distribution, metabolism, excretion, toxicity and related physiochemical properties of small molecules is counted among the most timeand cost-intensive tasks in chemical research. Computational approaches, such as machine learning methods, represent an economic alternative to predict these properties, however, the limited accuracy and irregular error rate of these predictions restrict their use within the research process. This thesis introduces and evaluates new ideas to enhance the acceptance and usage of kernel-based machine learning models in chemical research. The first part of the thesis investigates different approaches to improve the quality of machine learning predictions in drug discovery. By taking the precise chemical application into account we derive a new virtual screening algorithm, StructRank, which enables to focus on the correct ranking of compounds with high binding affinities. Then, the limits of single and ensemble learning methods are analyzed in the context of hERG inhibition. Since the drug discovery process often requires the assessment of new chemical series different to previously examined structures, we introduce and evaluate a clustered cross-validation scheme that stresses the extrapolation capacity of models. We present a local bias correction to incorporate new measurements efficiently and without the need for model retraining. The second part of the thesis is concerned with two different approaches to assess the reliability and interpretability of kernel-based prediction models. The first approach builds on the visual interpretation of predictions based on the most relevant training compounds. A compact method to calculate the impact of training compounds on single predictions is derived and the resulting visualizations are evaluated in a questionnaire study. The second approach addresses interpretability in terms of chemical features. Here, local gradients are employed to measure the local influence of specific chemical features on a predicted property. The capacity of this approach to identify local as well as global trends in Ames mutagenicity data, and, to reveal unique characteristics of compound classes such as steroids is depicted. Finally, we show that the potential of the developed methods extends beyond drug discovery by using local gradients to enhance the assessment of reaction rates in transition state theory. While computational chemistry remains a challenging field of application for machine learning, the present work introduces methods to improve and assess the quality of machine learning predictions in order to increase the usage of these methods in chemical research.
منابع مشابه
Mining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملAutomatic road crack detection and classification using image processing techniques, machine learning and integrated models in urban areas: A novel image binarization technique
The quality of the road pavement has always been one of the major concerns for governments around the world. Cracks in the asphalt are one of the most common road tensions that generally threaten the safety of roads and highways. In recent years, automated inspection methods such as image and video processing have been considered due to the high cost and error of manual metho...
متن کاملProtein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches
DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...
متن کاملTwo-stage fuzzy-stochastic programming for parallel machine scheduling problem with machine deterioration and operator learning effect
This paper deals with the determination of machine numbers and production schedules in manufacturing environments. In this line, a two-stage fuzzy stochastic programming model is discussed with fuzzy processing times where both deterioration and learning effects are evaluated simultaneously. The first stage focuses on the type and number of machines in order to minimize the total costs associat...
متن کاملA hybrid model based on machine learning and genetic algorithm for detecting fraud in financial statements
Financial statement fraud has increasingly become a serious problem for business, government, and investors. In fact, this threatens the reliability of capital markets, corporate heads, and even the audit profession. Auditors in particular face their apparent inability to detect large-scale fraud, and there are various ways to identify this problem. In order to identify this problem, the majori...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012